16-1 Landmark Extraction

The goal of audio fingerprinting is to identify the source of a given noisy music recording. The query music clip must come from the original music clip, with possible channel distortion and environmental noises. Audio fingerprinting has been successfully applied to audio-based search, with successful commercial products such as Shazam, Souundhound, Intonow, and so now. This section will describe one of the most adopted algorithm for audio fingerprinting.

One of the robust features that can resist channel distortion and environmental noise is salient peaks in a spectrogram. Salient peaks in a spectrogram can be defined as the local maxima along time and frequency. In practice, since there are too many local maxima, we usually need to apply some smoothing techniques to extract the truly salient ones. Here is an example of extracting salient peaks after smoothing:

Example 1: landmarkFind01.maddpath D:\users\jang\matlab\toolbox\audioFingerprinting waveFile = 'bad_romance_short.wav'; au=myAudioRead(waveFile); y=au.signal; fs=au.fs afpOpt=afpOptSet0; [landmarkVec, specMat, threshold1, peakTable1]=afpFeaExtract(y, fs, afpOpt, 1); fs = 8820 afpFeaExtract: 40 sec, 1249 cols, 786 peakTable, 349 bwd-pruned peakTable, 836 lmarks

Note that the smoothing is along both time and frequency axes: Here we also list some of the parameters for our implementation: After the salient peaks are identified, we can pair the peaks to form landmarks. This is done by defining a rectangle (called target zone) right after each peak. The leading peak is then paired with peaks in the target zone to form landmarks, as shown next:

Once we have the landmarks (pairs of salient peaks), we need to find a suitable representation such that the query landmarks can be compared with database landmarks efficiently. This is achieved through hashing. That is, the database landmarks are stored in a hash table such that when the query landmarks are obtained, the relevant database landmarks can be located and compared rapidly. More specifically, each landmark can be defined by the coordinates of the two landmarks [t1, f1, t2, f2], where [t1, f1] and [t2, f2] are the coordinates of the first and seconds peaks, respectively. Then we can define the 20-bit hash key as follows: We can use this 24-bit hash key to access the hash table of 2^24 entries. Each entry is a list of hash values composed of song ID and landmark offset time. To use the hash value effectively, we usually combine the song ID and landmark offset time into a single value by the formula:
hashValue = songId*timeSize + offsetTime
The value of these parameters are set as follows: For landmark extraction from the database songs, the frame shift is 256 (which is equal to frame size minus overlap). However, when we are extracting landmarks from the query clip, it is unlikely that the beginning of the query will align closely with the frame boundaries of the database songs. So the best we can do is to reduce the frame shift in order to have a better alignment. In our implementation, the frame shift for the query audio is usually 64 or 32: Therefore when we reduce the frame shift, we are going to have better alignment (and better accuracy). But at the same time, the frame rate and the no. of landmarks are also increased with longer computation time.
Audio Signal Processing and Recognition (音訊處理與辨識)